Import libraries

library(tidyr)      # for reshaping data
library(readr)      # parse_number example
library(dplyr)      # data wrangling/manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)    # visualization
library(gapminder)  # gapminder dataset
library(ggthemes)   # additional themes for ggplot2
library(ggrepel)    # helper package for ggplot2
library(kableExtra) # produce publication-ready tables
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(flextable)  # produce publication-ready tables
## 
## Attaching package: 'flextable'
## The following objects are masked from 'package:kableExtra':
## 
##     as_image, footnote
library(purrr)      # map function
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:flextable':
## 
##     compose
library(broom)      # tidy approach to statistical modeling

1 The tidyverse approach: Reshaping data with tidyr functions

In contrast to Part 1, data are not always nicely formatted for exploration and analysis. This means that before we clean and subet data, we actually have to reshape it first. We often want the format to be tidy, which makes the data ready for summarization, visualization, and analysis. Check out the Carpentries lesson on data formats to learn more.

Signs of messy datasets - Column headers are values, not variable names. - Multiple variables are not stored in one column. - Variables are stored in both rows and columns. - Multiple types of observational units are stored in the same table. - A single observational unit is stored in multiple tables.

Let’s take a look at the cases of untidy data.

1.1 Pipes

To utilize the power of the tidyverse, we will speak a slightly different dialect of R that uses pipes. “Pipes are a powerful tool for clearly expressing a sequence of multiple operations.” (Check out Chapter 18 in Wickham’s R for Data Science book to learn more).

The pipe operator %>% originally comes from the magrittr package. The idea behind the pipe operator is similar to what we learned about chaining functions in high school. f: B -> C and g: A -> B can be expressed as \(f(g(x))\). The pipe operator chains operations. When you read pipe operator, read as “and then” (Wickham’s recommendation). The keyboard shortcut is ctrl + shift + M. The key idea here is not creating temporary variables and focusing on verbs (functions).

2 We’ll learn more about this functional programming paradigm later on. (double check)

Pipes permit the user to accomplish more complex, and more varieties of, tasks in a more understandable way because the code can simply be read from left to right. The pipe takes the thing to the left of the pipe and “pipes” it into whatever is to the right of the pipe. This can continue in perpetuity until the task is accomplished.

Concept map for pipe operator. By Jeroen Janssens, Monica Alonso.

We will use pipes look at two main functions to reshape data: - pivot_longer() increases the number of rows (longer) and decreases the number of columns. The inverse function is - pivot_wider(). These functions improve the usability of gather() and spread(), which are now deprecated.

Two main differences between pivot_longer() and pivot_wider(): - In pivot_longer(), the arguments are named names_to and values_to (to). - In pivot_wider(), this pattern is opposite. The arguments are named names_from and values_from (from). - The number of required arguments for pivot_longer() is 3 (col, names_to, values_to). - The number of required arguments for pivot_wider() is 2 (names_from, values_from).

2.1 pivot_longer() - make it longer!

What pivot_longer() does (Source: https://www.storybench.org)

Let’s take a look at some cases of untidy data.

Messy Data Case 1 (Source: R for Data Science)

  • Make It Longer

    Col1 Col2 Col3

3 Challenge - messy data

  1. Why are these data not tidy?
table4a
# ?table4a

Let’s pivot (rotate by 90 degrees).

Concept map for pivoting. By Florian Schmoll, Monica Alonso.

Wha does pivot_longer() look like?

What pivot_longer() does (Source: https://www.storybench.org)

ready_4a <- table4a %>%
  pivot_longer(
    cols = c("1999", "2000"), # Selected columns
    names_to = "year", # Shorter columns (the columns going to be in one column called year)
    values_to = "cases"
  ) # Longer rows (the values are going to be in a separate column called named cases)
ready_4a
str(ready_4a)
## tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
##  $ country: chr [1:6] "Afghanistan" "Afghanistan" "Brazil" "Brazil" ...
##  $ year   : chr [1:6] "1999" "2000" "1999" "2000" ...
##  $ cases  : int [1:6] 745 2666 37737 80488 212258 213766
  • There’s another problem, did you catch it?
  • The data type of year variable should be numeric not character. By default, pivot_longer() transforms uninformative columns to character.
  • You can fix this problem by using names_transform argument.
ready4a <- table4a %>%
  pivot_longer(
    cols = c("1999", "2000"), # Put two columns together
    names_to = "year", # Shorter columns (the columns going to be in one column called year)
    values_to = "cases", # Longer rows (the values are going to be in a separate column called named cases)
    names_transform = list(year = readr::parse_number)
  ) # Transform the variable
ready4a
str(ready4a)
## tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
##  $ country: chr [1:6] "Afghanistan" "Afghanistan" "Brazil" "Brazil" ...
##  $ year   : num [1:6] 1999 2000 1999 2000 1999 ...
##  $ cases  : int [1:6] 745 2666 37737 80488 212258 213766

Note the style of readr::parse_number. This is called namespace resolution and is represented as package_name::function_name. This is a good habit to get into because you can always be explicit about which function you are using from which library!

Additional tips

parse_number() also keeps only numeric information in a variable.

# old way to call functions from libraries
library(readr)
parse_number("reply1994")
## [1] 1994

A flat file (e.g., CSV) is a rectangular shaped combination of strings. Parsing determines the type of each column and turns into a vector of a more specific type. Tidyverse has parse_ functions (from readr package) that are flexible and fast (e.g., parse_integer(), parse_double(), parse_logical(), parse_datetime(), parse_date(), parse_time(), parse_factor(), etc).

4 Challenge - pivot_longer()

  1. Why is this data not tidy? (This exercise comes from pivot function vigenette.) Too long or too wide?
billboard
  1. How can you fix it? Which pivot?
billboard_ready <- billboard %>%
  pivot_longer(
    cols = starts_with("wk"), # Use regular expressions
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE # Drop NAs
)
billboard_ready

4.1 pivot_wider() - make it wider!

Why is this data not tidy?

table2
  • Each observation is spread across two rows.

What pivot_wider() does (Source: https://www.storybench.org)

table2_ready <- table2 %>%
  pivot_wider(
    names_from = type, # first
    values_from = count # second
  )
table2_ready

4.2 Missing values

Sometimes, a consultee asks: “I don’t have missing values in my original dataframe. Then R said that I have missing values after I’ve done some data transformations. What happened?”

Here’s an answer.

R defines missing values in two ways. - Implicit missing values” simply not present in the data. - *Explicit missing values: flagged with NA

5 Challenge - explicit and implicit missing values

1 This example comes from R for Data Science. Where is the explicit missing value?

stocks <- tibble(
  year = c(2019, 2019, 2019, 2020, 2020, 2020),
  qtr = c(1, 2, 3, 2, 3, 4),
  return = c(1, 2, 3, NA, 2, 3)
)
stocks

Use pivot_wider() to widen the stocks dataframe into a datafrme named stocks_wide. Does stocks_wide have implicit missing values?

## YOUR CODE HERE

6 Wrangling and manipulating data with the dplyr package

Essentially, once data in the tidy format you can then wrangle and manipulate it. Remember those somewhat clumsy and repetitive operations we performed in Part 1?

This time we can use dplyr to expedite the data wrangling and manipulation process. For brevity’s sake, let’s load our clean Mesa, Arizona USA data that we saved at the end of Part 1.

load("data/preprocessed/clean.RData")
str(clean)
## 'data.frame':    96621 obs. of  7 variables:
##  $ date           : Factor w/ 1186 levels "2014-01-01","2014-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject_race   : Factor w/ 6 levels "asian/pacific islander",..: 3 6 6 2 6 6 6 6 6 3 ...
##  $ subject_sex    : Factor w/ 2 levels "female","male": 1 1 2 2 2 2 2 2 2 1 ...
##  $ subject_age    : int  36 58 19 46 27 58 24 27 27 29 ...
##  $ officer_id_hash: Factor w/ 770 levels "0027657d5f","003bde5bfa",..: 637 677 226 529 272 272 754 754 128 146 ...
##  $ violation      : Factor w/ 716 levels "2 OCCUPANTS UNDER 18 CLASS G LICENSE",..: 405 391 614 640 125 157 225 595 316 65 ...
##  $ arrest_made    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

6.1 arrange()

Sort values in ascending or descending order.

head(dplyr::arrange(clean, subject_age)) # Low to high (default))
head(dplyr::arrange(clean, desc(subject_age))) # High to low 
# or, use the minus symbol to specify descending order
head(dplyr::arrange(clean, -subject_age)) # Also high to low 

6.2 rename()

Change column names.

names(clean)
## [1] "date"            "subject_race"    "subject_sex"     "subject_age"    
## [5] "officer_id_hash" "violation"       "arrest_made"
clean <- clean %>%
  rename(
    race = # NEW name
    subject_race,  # OLD name
    sex = subject_sex, 
    age = subject_age
  ) # OLD name
names(clean)
## [1] "date"            "race"            "sex"             "age"            
## [5] "officer_id_hash" "violation"       "arrest_made"

6.3 filter()

Subset rows (single condition)

female_drivers <- clean %>%
  filter(sex == "female") %>%
  arrange(age)
head(female_drivers)

Subset rows (multiple conditions)

old_males <- clean %>%
  filter(sex == "male", age > 90) %>%
  arrange(-age)
head(old_males)

7 Challenge - filter()

  1. Use filter(between()) to subset a dataframe that contains only drivers between the ages of 10 and 13.
## YOUR CODE HERE

7.1 select()

Subset columns.

Select one or more columns (just “date” and “arrest_made”).

sub1 <- clean %>%
  select("date", "arrest_made")
head(sub1)

Select only numeric (or integer) columns

sub2 <- clean %>%
  dplyr::select(where(is.numeric))
head(sub2)

Select columns that contains the letter “s”

sub3 <- clean %>%
  dplyr::select(contains("s"))
head(sub3)

Select columns that start with the letter “a”

sub4 <- clean %>%
  dplyr::select(starts_with("a"))
head(sub4)

Create a new column

# the base R way
clean$new_col_name <- clean$age / 10
head(clean)
# the dplyr way
clean <- clean %>%
  mutate(new_dplyr = age * 10)
head(clean)

Delete columns

clean$new_col_name <- NULL
head(clean)
clean$new_dplyr <- NULL
head(clean)

7.2 count()

Count how many…

sub5 <- clean %>%
  count(sex, sort = TRUE)
sub5

7.3 group_by()

Perform grouping operation before including another operation

sexes <- clean %>%
  group_by(sex) %>%
  count()
sexes

7.4 summarize()

Create a new tibble with summary information. Calculate the average age of drivers by sex

sexes_age <- clean %>%
  group_by(sex) %>%
  # the text to the left of the equals sign is the new column name (avg_age)
  # the operation to the right performs the function (the average/mean)
  summarize(avg_age = mean(age, na.rm = TRUE))
sexes_age

Summarize by multiple conditions

sexes_age_race <- clean %>%
  group_by(sex, race) %>%
  summarize(avg_age = mean(age, na.rm = TRUE))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
sexes_age_race

Include mutiple computations

sexes_age_race <- clean %>%
  group_by(sex, race) %>%
  summarize(avg_age = mean(age, na.rm = TRUE), 
            count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
sexes_age_race

7.5 mutate()

Add a new column to an existing data frame with mutate()

Use case_when() like an extension of ifelse() in base R: https://dplyr.tidyverse.org/reference/case_when.html

clean <- clean %>%
  mutate(agegroup = case_when(age >= 80 ~ "Elderly",
                              age >= 18 ~ "Adult", 
                              age < 18 ~ "Adolescent"))
head(clean)
table(clean$agegroup)
## 
## Adolescent      Adult    Elderly 
##       1661      92343       1169

Or, add a simpler computation

clean <- clean %>%
  group_by(sex) %>%
  mutate(avg_age = mean(age, na.rm = TRUE))
head(clean)

7.6 Create publishable ready tables

# for HTML and LaTeX
sexes_age_race %>% kableExtra::kable()

# for HTML and MS Office Suite
sexes_age_race %>% flextable::flextable()

8 Challenge - combining group_by() and summarize()

  1. Create a new tibble that contains average age and standard deviation by sex and arrest made.
## YOUR CODE HERE

9 Data visualization with ggplot2

  • The following material is adapted from Kieran Healy’s excellent book (2019) on data visualization and Hadley Wickham’s equally excellent book on ggplot2. For more theoretical discussions, I recommend you to read The Grammar of Graphics by Leland Wilkinson.

  • Why should we care about data visualization? More precisely, why should we learn the grammar of statistical graphics?

  • Sometimes, pictures are better tools than words in 1) exploring, 2) understanding, and 3) explaining data.

9.1 Anscombe’s quartet motivating example

Anscombe’s quartet consists of four datasets, which are very similar in terms of their descriptive statistical properties (mean, variance, correlation, regression line, etc.) but are quite different when presented visually.

# Set a ggplot2 theme to get rid of the default gray background
theme_set(theme_minimal())

View the data

anscombe
# View correlations between x and y variables
cor(anscombe)[c(1:4), c(5:8)]
##            y1         y2         y3         y4
## x1  0.8164205  0.8162365  0.8162867 -0.3140467
## x2  0.8164205  0.8162365  0.8162867 -0.3140467
## x3  0.8164205  0.8162365  0.8162867 -0.3140467
## x4 -0.5290927 -0.7184365 -0.3446610  0.8165214

process the data, then plot. A little code can go a long way!

anscombe_processed <- anscombe %>%
  tidyr::gather(x_name, x_value, x1:x4) %>%
  tidyr::gather(y_name, y_value, y1:y4)
head(anscombe_processed)
# plot
# dont worry, we will cover what these different aspects of the grammar of graphic mean!
# also note that we can pipe directly into ggplot! More on this below.
anscombe_processed %>%
  ggplot(aes(x = x_value, y = y_value)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  facet_grid(x_name ~ y_name) +
  theme_bw() +
  labs(
    x = "X values",
    y = "Y values",
    title = "Anscombe's quartet"
  )
## `geom_smooth()` using formula 'y ~ x'

9.2 The grammar of graphics

The grammar of graphics is comprised of several themes. Remember to see the Wilke textbook linked in the readme file for gentle introductions.

- data
- aesthetic attributes (color, shape, size)
- geometric objects (points, lines, bars)
- stats (summary stats)
- scales (map values in the data space)
- coord (data coordinates)
- facet (facetting specifications)

9.3 Mapping

To map something means to link a variable with something you can see on a plot.

  • aes (aesthetic mappings or aesthetics) tells which variables (x, y) in your data should be represented by which visual elements (color, shape, size) in the plot.
  • geom_ tells the type of plot you are going to use (histogram, boxplots, scatterplot, etc.)

Produce the base layer. We have not yet specified a geom, so no data appear on the plot. Let’s look at a scatterplot (i.e., a visualization of two continuous varialbes - one on the x-axis and one on the y-axis) to see how these mapping pieces work.

Each layer is joined by a plus symbol +

# scatterplot example
p <- ggplot(
  data = gapminder,
  mapping = aes(x = gdpPercap, y = lifeExp)) 
p

# specify points as our geom, i.e., a scatterplot
p + geom_point()

# geom_smooth has calculated a smoothed line
# the shaded area is the standard error for the line
p + geom_point() + 
  geom_smooth() 
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

9.4 Histogram

Plot a single variable’s distribution.

geom_histogram(): For the probability distribution of a continuous variable. Bins divide the entire range of values into a series of intervals (see the Wiki entry).

data(midwest) # load midwest dataset
midwest
midwest %>%
  ggplot(aes(x = area)) +
  # the stat_bin argument picks up 30 bins (or "buckets") by default.
  geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

midwest %>%
  ggplot(aes(x = area)) +
  # only 10 bins
  geom_histogram(bins = 10)

We can also directly pass in a subsetted dataset as the argument to the data = parameter

# subset just Ohio and Indiana
ggplot(
  data = subset(midwest, state %in% c("OH", "IN")),
  mapping = aes(x = percollege, fill = state)
) +
  # alpha adjusts the transparency
  geom_histogram(alpha = 0.7, bins = 20) +
  scale_fill_viridis_d()

9.5 Size and color

There’s also fill argument (mostly used in geom_bar()). Color aes affects the appearance of lines and points, fill is for the filled areas of bars, polygons, and in some cases, the interior of a smoother’s standard error ribbon.

Create a scatterplot with gdpPercap on the x-axis, lifeExp on the y-axis, and map point size to population size.

ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    size = pop
  )
) +
  geom_point()

Add the color mapping and a viridis scale to make the plot more interpretable.

ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    size = pop,
    color = continent
  )
) +
  geom_point() +
  scale_color_viridis_d()

Aesthetics also can be mapped per geom.

p + geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

p + geom_point(alpha = 0.2) + 
  # color only the line red
  geom_smooth(color = "red", se = FALSE, size = 2, method = "loess") 
## `geom_smooth()` using formula 'y ~ x'

Notice the different behavior based on where color is defined…

ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    color = continent # color in the global mapping is different from...
  )
) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") + # ... color in geom_smooth 
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  )
## `geom_smooth()` using formula 'y ~ x'

9.6 Coordinates and scales

Coordinates can be flipped

p + geom_point() +
  coord_flip() # coord_type

The data is heavily bunched up against the left side.

p + geom_point() # without scaling

p + geom_point() +
  # log 10 transform the x-axis
  scale_x_log10() + 
  geom_smooth(method = "lm", color = "green")
## `geom_smooth()` using formula 'y ~ x'

9.7 Labels and guides

scales package has some useful premade formatting functions. You can either load scales or just grab the function you need from the library using scales::

Refer back to the p base layer

p

p + geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  scale_x_log10(labels = scales::dollar) +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  )
## `geom_smooth()` using formula 'y ~ x'

9.8 Themes and ggsave

Also note the handy “Export” button on your plotting pane.

p + geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  scale_x_log10(labels = scales::dollar) +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  ) +
  # include a nice background
  ggthemes::theme_clean()
## `geom_smooth()` using formula 'y ~ x'

9.9 Saving/exporting figures

# save your work to the working directory
ggsave("figure_example.png")
## Saving 7 x 5 in image
## `geom_smooth()` using formula 'y ~ x'

9.10 Multiple plots: Facetting and compound figures

Basic ideas:

  • Grouping: tell ggplot2 about the structure of your data
  • Facetting: break up your data into pieces for a plot

The importance of grouping

  • Can you guess what’s wrong here when we try to make a line plot?
p + geom_point()

p + geom_line()

geom_line joins up all the lines for each particular year in the order they appear in the dataset. ggplot2 does not know the yearly observations in your data are grouped by country.

Note that you need grouping when the grouping information you need to tell is not built into the mapped variables (like continent).

Facetting

Let’s try it in a facetted example. Facetting is to make small multiples; in our case, sub-plots based on a grouping variable like “continent”.

  • facet_wrap: based on a single categorical variable like facet_wrap(~single_categorical_variable). Your panels will be laid out in order and then wrapped into a grid.
  • facet_grid: when you want to cross-classify some data by two categorical variables like facet_grid(one_cat_variable ~ two_cat_variable).
p + geom_line(aes(group = country)) # group by, # The outlier is Kuwait.

p + geom_line(aes(group = country)) + facet_wrap(~continent) # facetting

p + geom_line(aes(group = country), color = "gray70") +
  geom_smooth(size = 1.1, method = "loess", se = FALSE) +
  scale_y_log10(labels = scales::dollar) +
  facet_wrap(~continent, ncol = 5) + # for single categorical variable; for multiple categorical variables use facet_grid()
  labs(
    x = "Year",
    y = "GDP per capita",
    title = "GDP per capita on Five continents"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## `geom_smooth()` using formula 'y ~ x'

Transforming

Transforming: perform some calculations on or summarize your data before producing the plot. Use pipes to summarize data

Let’s experiment with bar charts. By default, geom_bar uses stat = “bins”, which makes the height of each bar equal to the number of cases in each group. If you have a y column, then you should use stat = "identity" argument. Alternatively, you can use geom_col().

First, summarize mean gdpPercap and lifeExp

gapminder_formatted <- gapminder %>%
  group_by(continent, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  )
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
gapminder_formatted
# plot! that looks much nicer :) 
ggplot(data = gapminder_formatted, aes(x = year, y = lifeExp_mean, color = continent)) +
  geom_point() +
  labs(
    x = "Year",
    y = "Life expectancy",
    title = "Life expectancy on Five continents"
  )

Or, facet by continent!

# plot! that looks much nicer :) 
ggplot(data = gapminder_formatted, aes(x = year, y = lifeExp_mean, color = continent)) +
  geom_point() +
  labs(
    x = "Year",
    y = "Life expectancy",
    title = "Life expectancy on Five continents"
  ) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

# Facet by country where continent is filtered by Europe
gapminder %>%
  filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = year, y = lifeExp_mean)) +
  geom_point() +
  labs(
    x = "Year",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  facet_wrap(~country)
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.

Or, use boxplots…

# Descending alphabetical sorting
gapminder %>%
  filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = country, y = lifeExp_mean)) +
  geom_boxplot() +
  labs(
    x = "Country",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  coord_flip()
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.

# reorder by ascending
gapminder %>%
  filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = reorder(country, lifeExp_mean), y = lifeExp_mean)) +
  geom_boxplot() +
  labs(
    x = "Country",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  coord_flip()
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.

# reorder by descending (note the minus sign)
gapminder %>%
  filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = reorder(country, -lifeExp_mean), y = lifeExp_mean)) +
  geom_boxplot() +
  labs(
    x = "Country",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  coord_flip()
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.

If you want to compound multiple plots into a single figure, whceck out the gridExtra, cowplot and patchwork R packages.

9.11 Plotting text

Label countries in Asia or the Americas only by filtering with the logical or | operator. Facet by continent.

gapminder %>%
  filter(continent == "Asia" | continent == "Americas") %>%
  group_by(continent, country) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = gdp_mean, y = lifeExp_mean)) +
  geom_point() +
  geom_text(aes(label = country)) +
  scale_x_log10() +
  facet_grid(~continent)
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

Add a nice label…

gapminder %>%
  filter(continent == "Asia" | continent == "Americas") %>%
  group_by(continent, country) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = gdp_mean, y = lifeExp_mean)) +
  geom_point() +
  geom_label(aes(label = country)) +
  scale_x_log10() +
  facet_grid(~continent)
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

… and avoid text overlaps.

gapminder %>%
  filter(continent == "Asia" | continent == "Americas") %>%
  group_by(continent, country) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = gdp_mean, y = lifeExp_mean)) +
  geom_point() +
  ggrepel::geom_text_repel(aes(label = country)) + # there's also geom_label_repel
  scale_x_log10() +
  facet_grid(~continent)
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

10 Challenge - ggplot2

  1. Use the load() function to load “oak.RData”.
## YOUR CODE HERE
load("data/preprocessed/oak.RData")
  1. Make boxplots for male and female drivers.
  • Define a custom y-axis with intervals of 0, 20, 40, 60, 80, and 100.
  • Fill the boxes with the “BuPu” RColorBrewer palette.
  • Apply the Stata ggtheme
## YOUR CODE HERE
ggplot(clean, 
       aes(x = sex, 
           y = age,
           fill = sex)) + 
  geom_boxplot() + 
  scale_y_continuous(breaks = c(0, 20, 40, 60, 80, 100),
                     limits = c(0, 100)) + 
  scale_fill_brewer(palette = "BuPu") + 
  ggthemes::theme_stata() + 
  theme(legend.position = "none")
## Warning: Removed 1448 rows containing non-finite values (stat_boxplot).

10.1 Ploting models

In plotting models, we extensively use David Robinson’s broom package in R. The idea is to transform model outputs (i.e., predictions and estimations) into tidy objects so that we can easily combine, separate, and visualize these elements.

Plotting several fits at the same time

Note the use of three geom_smooth() functions. What happens if you use a hashtag to comment out one or the other?

model_colors <- RColorBrewer::brewer.pal(3, "Set1") # select three qualitatively different colors from a larger palette.
gapminder %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS")) +
  geom_smooth(
    method = "lm", formula = y ~ splines::bs(x, df = 3),
    aes(color = "Cubic Spline", fill = "Cubic Spline")
  ) +
  geom_smooth(method = "loess", aes(color = "LOESS", fill = "LOESS")) +
  theme(legend.position = "top") +
  scale_color_manual(name = "Models", values = model_colors) +
  scale_fill_manual(name = "Models", values = model_colors)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Extracting model outcomes

Save a linear regression model in a variable named out.

# regression model
out <- lm(
  formula = lifeExp ~ gdpPercap + pop + continent,
  data = gapminder
)

tidy() is a method in the broom package. It “constructs a dataframe that summarizes the model’s statistical findings”. As the description states, tidy is a function that can be used for various models. For instance, a tidy can extract following information from a regression model.

  • Term: a term being estimated
  • p.value
  • statistic: a test statistic used to compute p-value
  • estimate
  • conf.low: the low end of a confidence interval
  • conf.high: the high end of a confidence interval
  • df: degrees of freedom

10.1.0.1 Coefficients and confidence intervals

# estimates
out_comp <- broom::tidy(out)

# construct the base layer
p <- out_comp %>%
  ggplot(aes(x = term, y = estimate))
p

# plot the estimates as points
p + geom_point() +
  coord_flip() +
  theme_bw()

10.1.0.2 Confidence intervals

# save the estimates plus their confidence intervals
out_conf <- broom::tidy(out, conf.int = TRUE)

# plot coefficients using geom_pointrange()
out_conf %>%
  ggplot(aes(x = reorder(term, estimate), y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_pointrange() +
  coord_flip() +
  labs(x = "", y = "OLS Estimate") +
  theme_bw()

# plot coefficients with errorbars
out_conf %>%
  ggplot(aes(x = estimate, y = reorder(term, estimate))) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high,
                     height = .25)) +
  labs(y = "", x = "OLS Estimate") +
  theme_bw()

11 Statistical modeling with broom

11.1 Nesting

The following example comes from R for Data Science by Garrett Grolemund and Hadley Wickham. Nesting allow you to run multiple models simultaneously? Using a nested data frame. Check out this YouTube video to learn more: https://www.youtube.com/watch?v=rz3_FDVt9eg

You can think about nesting this way: - Grouped data: each row = an observation - Nested data: each row = a group

12 Challenge - nesting

  1. In the following example, why do we use country and continent for nesting variables?
nested <- gapminder %>%
  group_by(country, continent) %>%
  nest()
head(nested)
# Retrieve just the first list from the grouped data frame
nested$data %>% purrr::pluck(1)
  • Custom function
lm_model <- function(df) {
  lm(lifeExp ~ year, data = df)
}
  • Apply function to the nested data
# Apply m_model to the nested data
nested <- nested %>%
  mutate(models = purrr::map(data, lm_model)) # Add the list object as a new column
head(nested)

S3 is part of R’s object-oriented systems. If you need further information, check out this section in Hadley’s Advanced R.

12.1 glance(), tidy(), and augment()

  • broom::glance(model): for evaluating model quality and/or complexity
  • broom::tidy(model): for extracting each coefficient in the model (the estimates + its variability)
  • broom::augment(model, data): for getting extra values (residuals, and influence statistics). A convenient tool in case if you want to plot fitted values and raw data together.

Check out this broom YouTube video to learn more: https://www.youtube.com/watch?v=7VGPUBWGv6g&ab_channel=Work-Bench

glanced <- nested %>%
  mutate(glance = map(models, broom::glance))
glanced
# Pluck the first item on the list 
glanced$glance %>% pluck(1)
# Pull just the p.value 
glanced$glance %>% pluck(1) %>% pull(p.value)
##        value 
## 9.835213e-08

unnest() unpacks the list objects stored in the glanced column

glanced %>%
  unnest(glance) %>%
  arrange(r.squared) 
glanced %>%
  unnest(glance) %>%
  ggplot(aes(continent, r.squared)) +
  geom_jitter(width = 0.5)

tidy()

Creates a tibble from a model’s statistical output

# View the nested structure
# we see the number of observations and other five columns in the column named "data"
nested <- gapminder %>%
  group_by(continent) %>%
  nest()
nested
# specify our model in a column named "models"
nested <- nested %>%
  mutate(models = map(data, ~lm(lifeExp ~ year + country, data = .))) 
nested
# tidy up!
tidied <- nested %>%
  mutate(tidied = map(models, broom::tidy))
tidied
# construct tibble of estimates and p-values
model_out <- tidied %>%
  unnest(tidied) %>%
  mutate(term = stringr::str_replace(term, "country", "")) %>%
  select(continent, term, estimate, p.value) %>%
  mutate(p_threshold = ifelse(p.value < 0.05, 1, 0))
model_out
# show countries with statistical insignificance
model_out %>% filter(p_threshold == 1) %>% pull(term) %>% unique()
##   [1] "(Intercept)"              "year"                    
##   [3] "Bahrain"                  "Bangladesh"              
##   [5] "Cambodia"                 "China"                   
##   [7] "Hong Kong, China"         "India"                   
##   [9] "Indonesia"                "Iran"                    
##  [11] "Iraq"                     "Israel"                  
##  [13] "Japan"                    "Jordan"                  
##  [15] "Korea, Dem. Rep."         "Korea, Rep."             
##  [17] "Kuwait"                   "Lebanon"                 
##  [19] "Malaysia"                 "Mongolia"                
##  [21] "Myanmar"                  "Nepal"                   
##  [23] "Oman"                     "Pakistan"                
##  [25] "Philippines"              "Saudi Arabia"            
##  [27] "Singapore"                "Sri Lanka"               
##  [29] "Syria"                    "Taiwan"                  
##  [31] "Thailand"                 "Vietnam"                 
##  [33] "West Bank and Gaza"       "Yemen, Rep."             
##  [35] "Austria"                  "Belgium"                 
##  [37] "Croatia"                  "Czech Republic"          
##  [39] "Denmark"                  "Finland"                 
##  [41] "France"                   "Germany"                 
##  [43] "Greece"                   "Iceland"                 
##  [45] "Ireland"                  "Italy"                   
##  [47] "Montenegro"               "Netherlands"             
##  [49] "Norway"                   "Poland"                  
##  [51] "Portugal"                 "Slovak Republic"         
##  [53] "Slovenia"                 "Spain"                   
##  [55] "Sweden"                   "Switzerland"             
##  [57] "Turkey"                   "United Kingdom"          
##  [59] "Angola"                   "Benin"                   
##  [61] "Botswana"                 "Burkina Faso"            
##  [63] "Burundi"                  "Cameroon"                
##  [65] "Central African Republic" "Chad"                    
##  [67] "Comoros"                  "Congo, Dem. Rep."        
##  [69] "Congo, Rep."              "Cote d'Ivoire"           
##  [71] "Djibouti"                 "Equatorial Guinea"       
##  [73] "Eritrea"                  "Ethiopia"                
##  [75] "Gabon"                    "Gambia"                  
##  [77] "Ghana"                    "Guinea"                  
##  [79] "Guinea-Bissau"            "Kenya"                   
##  [81] "Lesotho"                  "Liberia"                 
##  [83] "Madagascar"               "Malawi"                  
##  [85] "Mali"                     "Mauritania"              
##  [87] "Mauritius"                "Mozambique"              
##  [89] "Namibia"                  "Niger"                   
##  [91] "Nigeria"                  "Reunion"                 
##  [93] "Rwanda"                   "Senegal"                 
##  [95] "Sierra Leone"             "Somalia"                 
##  [97] "South Africa"             "Sudan"                   
##  [99] "Swaziland"                "Tanzania"                
## [101] "Togo"                     "Uganda"                  
## [103] "Zambia"                   "Zimbabwe"                
## [105] "Bolivia"                  "Brazil"                  
## [107] "Canada"                   "Colombia"                
## [109] "Dominican Republic"       "Ecuador"                 
## [111] "El Salvador"              "Guatemala"               
## [113] "Haiti"                    "Honduras"                
## [115] "Mexico"                   "Nicaragua"               
## [117] "Paraguay"                 "Peru"                    
## [119] "Puerto Rico"              "Trinidad and Tobago"     
## [121] "United States"            "Venezuela"               
## [123] "New Zealand"
# show countries with statistical significance
model_out %>% filter(p_threshold == 0) %>% pull(term) %>% unique()
##  [1] "Bosnia and Herzegovina" "Bulgaria"               "Hungary"               
##  [4] "Romania"                "Serbia"                 "Egypt"                 
##  [7] "Libya"                  "Morocco"                "Sao Tome and Principe" 
## [10] "Tunisia"                "Chile"                  "Costa Rica"            
## [13] "Cuba"                   "Jamaica"                "Panama"                
## [16] "Uruguay"